speech production
Limit cycles for speech
Gafos, Adamantios I., Kuberski, Stephan R.
Rhythmic fluctuations in acoustic energy and accompanying neuronal excitations in cortical oscillations are characteristic of human speech, yet whether a corresponding rhythmicity inheres in the articulatory movements that generate speech remains unclear. The received understanding of speech movements as discrete, goal-oriented actions struggles to make contact with the rhythmicity findings. In this work, we demonstrate that an unintuitive -- but no less principled than the conventional -- representation for discrete movements reveals a pervasive limit cycle organization and unlocks the recovery of previously inaccessible rhythmic structure underlying the motor activity of speech. These results help resolve a time-honored tension between the ubiquity of biological rhythmicity and discreteness in speech, the quintessential human higher function, by revealing a rhythmic organization at the most fundamental level of individual articulatory actions.
The Dynamic Articulatory Model DYNARTmo: Dynamic Movement Generation and Speech Gestures
The neural generation and control of speech utterances is a complex process that is still not fully understood. However, several neurobiologically inspired models have been proposed that describe the hierarchical control concept of utterance generation (e.g., Hickok and Poeppel (2012); Bohland et al. (2010); Kröger et al. (2020); Parrell et al. (2018)). This process begins with the neural activation of the cognitive-linguistic representation of an utterance, followed by a higher-level premotor representation, leading to neuromuscular activation patterns, and finally to the articulatory-acoustic realization of the utterance (cf.
Speech Emotion Recognition with Phonation Excitation Information and Articulatory Kinematics
Zhang, Ziqian, Huang, Min, Xiao, Zhongzhe
Speech emotion recognition (SER) has advanced significantly for the sake of deep-learning methods, while textual information further enhances its performance. However, few studies have focused on the physiological information during speech production, which also encompasses speaker traits, including emotional states. To bridge this gap, we conducted a series of experiments to investigate the potential of the phonation excitation information and articulatory kinematics for SER. Due to the scarcity of training data for this purpose, we introduce a portrayed emotional dataset, STEM-E2VA, which includes audio and physiological data such as electroglottography (EGG) and electromagnetic articulography (EMA). EGG and EMA provide information of phonation excitation and articulatory kinematics, respectively. Additionally, we performed emotion recognition using estimated physiological data derived through inversion methods from speech, instead of collected EGG and EMA, to explore the feasibility of applying such physiological information in real-world SER. Experimental results confirm the effectiveness of incorporating physiological information about speech production for SER and demonstrate its potential for practical use in real-world scenarios.
DYNARTmo: A Dynamic Articulatory Model for Visualization of Speech Movement Patterns
We present DYNARTmo, a dynamic articulatory model designed to visualize speech articulation processes in a two-dimensional midsagittal plane. The model builds upon the UK-DYNAMO framework and integrates principles of articulatory underspecification, segmental and gestural control, and coarticulation. DYNARTmo simulates six key articulators based on ten continuous and six discrete control parameters, allowing for the generation of both vocalic and consonantal articulatory configurations. The current implementation is embedded in a web-based application (SpeechArticulationTrainer) that includes sagittal, glottal, and palatal views, making it suitable for use in phonetics education and speech therapy. While this paper focuses on the static modeling aspects, future work will address dynamic movement generation and integration with articulatory-acoustic modules.
MiSTR: Multi-Modal iEEG-to-Speech Synthesis with Transformer-Based Prosody Prediction and Neural Phase Reconstruction
Al-Radhi, Mohammed Salah, Németh, Géza, Gerazov, Branislav
Speech synthesis from intracranial EEG (iEEG) signals offers a promising avenue for restoring communication in individuals with severe speech impairments. However, achieving intelligible and natural speech remains challenging due to limitations in feature representation, prosody modeling, and phase reconstruction. We introduce MiSTR, a deep-learning framework that integrates: 1) Wavelet-based feature extraction to capture fine-grained temporal, spectral, and neurophysiological representations of iEEG signals, 2) A Transformer-based decoder for prosody-aware spectrogram prediction, and 3) A neural phase vocoder enforcing harmonic consistency via adaptive spectral correction. Evaluated on a public iEEG dataset, MiSTR achieves state-of-the-art speech intelligibility, with a mean Pearson correlation of 0.91 between reconstructed and original Mel spectrograms, improving over existing neural speech synthesis baselines.
Towards disentangling the contributions of articulation and acoustics in multimodal phoneme recognition
Foley, Sean, Nguyen, Hong, Lee, Jihwan, Kadiri, Sudarsana Reddy, Byrd, Dani, Goldstein, Louis, Narayanan, Shrikanth
Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models' attention weights highlight differences in acoustic and articulatory timing for certain phonemes.
Articulatory Feature Prediction from Surface EMG during Speech Production
Lee, Jihwan, Huang, Kevin, Avramidis, Kleanthis, Pistrosch, Simon, Gonzalez-Machorro, Monica, Lee, Yoonjeong, Schuller, Björn, Goldstein, Louis, Narayanan, Shrikanth
We present a model for predicting articulatory features from surface electromyography (EMG) signals during speech production. The proposed model integrates convolutional layers and a Transformer block, followed by separate predictors for articulatory features. Our approach achieves a high prediction correlation of approximately 0.9 for most articulatory features. Furthermore, we demonstrate that these predicted articulatory features can be decoded into intelligible speech waveforms. To our knowledge, this is the first method to decode speech waveforms from surface EMG via articulatory features, offering a novel approach to EMG-based speech synthesis. Additionally, we analyze the relationship between EMG electrode placement and articulatory feature predictability, providing knowledge-driven insights for optimizing EMG electrode configurations. The source code and decoded speech samples are publicly available.
Decoding Phone Pairs from MEG Signals Across Speech Modalities
de Zuazo, Xabier, Navas, Eva, Saratxaga, Ibon, Bourguignon, Mathieu, Molinaro, Nicola
Understanding the neural mechanisms underlying speech production is essential for both advancing cognitive neuroscience theory and developing practical communication technologies. In this study, we investigated magnetoencephalography signals to decode phones from brain activity during speech production and perception (passive listening and voice playback) tasks. Using a dataset comprising 17 participants, we performed pairwise phone classification, extending our analysis to 15 phonetic pairs. Multiple machine learning approaches, including regularized linear models and neural network architectures, were compared to determine their effectiveness in decoding phonetic information. Our results demonstrate significantly higher decoding accuracy during speech production (76.6%) compared to passive listening and playback modalities (~51%), emphasizing the richer neural information available during overt speech. Among the models, the Elastic Net classifier consistently outperformed more complex neural networks, highlighting the effectiveness of traditional regularization techniques when applied to limited and high-dimensional MEG datasets. Besides, analysis of specific brain frequency bands revealed that low-frequency oscillations, particularly Delta (0.2-3 Hz) and Theta (4-7 Hz), contributed the most substantially to decoding accuracy, suggesting that these bands encode critical speech production-related neural processes. Despite using advanced denoising methods, it remains unclear whether decoding solely reflects neural activity or if residual muscular or movement artifacts also contributed, indicating the need for further methodological refinement. Overall, our findings underline the critical importance of examining overt speech production paradigms, which, despite their complexity, offer opportunities to improve brain-computer interfaces to help individuals with severe speech impairments.
Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture
Pillai, Leena G, Mubarak, D. Muhammad Noorul, Sherly, Elizabeth
Speech production is a complex sequential process which involve the coordination of various articulatory features. Among them tongue being a highly versatile active articulator responsible for shaping airflow to produce targeted speech sounds that are intellectual, clear, and distinct. This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics using a stacked Bidirectional Long Short-Term Memory (BiLSTM) architecture, combined with a one-dimensional Convolutional Neural Network (CNN) for post-processing with fixed weights initialization. The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets, each introducing variations in terms of geographical origin, linguistic characteristics, phonetic diversity, and recording equipment. The performance of the model is assessed in Speaker Dependent (SD), Speaker Independent (SI), corpus dependent (CD) and cross corpus (CC) modes. Experimental results indicate that the proposed model with fixed weights approach outperformed the adaptive weights initialization with in relatively minimal number of training epochs. These findings contribute to the development of robust and efficient models for articulatory feature prediction, paving the way for advancements in speech production research and applications.
The order in speech disorder: a scoping review of state of the art machine learning methods for clinical speech classification
Moell, Birger, Aronsson, Fredrik Sand, Östberg, Per, Beskow, Jonas
Background:Speech patterns have emerged as potential diagnostic markers for conditions with varying etiologies. Machine learning (ML) presents an opportunity to harness these patterns for accurate disease diagnosis. Objective: This review synthesized findings from studies exploring ML's capability in leveraging speech for the diagnosis of neurological, laryngeal and mental disorders. Methods: A systematic examination of 564 articles was conducted with 91 articles included in the study, which encompassed a wide spectrum of conditions, ranging from voice pathologies to mental and neurological disorders. Methods for speech classifications were assessed based on the relevant studies and scored between 0-10 based on the reported diagnostic accuracy of their ML models. Results: High diagnostic accuracies were consistently observed for laryngeal disorders, dysarthria, and changes related to speech in Parkinsons disease. These findings indicate the robust potential of speech as a diagnostic tool. Disorders like depression, schizophrenia, mild cognitive impairment and Alzheimers dementia also demonstrated high accuracies, albeit with some variability across studies. Meanwhile, disorders like OCD and autism highlighted the need for more extensive research to ascertain the relationship between speech patterns and the respective conditions. Conclusion: ML models utilizing speech patterns demonstrate promising potential in diagnosing a range of mental, laryngeal, and neurological disorders. However, the efficacy varies across conditions, and further research is needed. The integration of these models into clinical practice could potentially revolutionize the evaluation and diagnosis of a number of different medical conditions.